[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504
[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path#38504tjtanaa merged 10 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request updates the MoE routing logic to handle bitmatrix-based routing and introduces a mechanism in the scheduler to defer block freeing during asynchronous KV transfers. Specifically, it adds a _deferred_block_req_ids set to track these requests and ensures the engine continues stepping while transfers are pending. I have no feedback to provide.
|
|
||
| def has_finished_requests(self) -> bool: | ||
| return len(self.finished_req_ids) > 0 | ||
| return len(self.finished_req_ids) > 0 or len(self._deferred_block_req_ids) > 0 |
There was a problem hiding this comment.
Yep, you're right, I reverted that.
…outing Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
| if sm_first: | ||
| logits = torch.softmax(logits, dim=-1) | ||
| sparse_logits = topk(logits, n_expts_act, apply_softmax=not sm_first) | ||
| return legacy_routing_from_sparsematrix( |
There was a problem hiding this comment.
after you remove this legacy_routing_from_sparsematrix is not used any more. Let's remove it.
Moreover, please disclose the accuracy of the models.
I am not familiar with this part of the code.
There was a problem hiding this comment.
I removed that legacy routing logic. I enabled also GPQA Eval tests too for GPT-OSS accuracy.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Does it affect performance? |
|
@tjtanaa Actually I was about to cite the tests regarding accuracy, but think I got to rebase first to get clean logs: https://buildkite.com/vllm/amd-ci/builds/7267/steps/canvas?sid=019d553e-778b-452a-bde4-8882ae4cdcf2&tab=output |
…0 eval configs Signed-off-by: Andreas Karatzas <akaratza@amd.com>
GPQA accuracy (
|
| Config | main | PR branch |
|---|---|---|
| mxfp4-bf16-aiter | PASSED (0.5739) | PASSED (0.5600) |
| mxfp4-bf16-triton | PASSED (0.5701) | PASSED (0.5726) |
| mxfp4-fp8-triton | PASSED (0.5562) | PASSED (0.5593) |
| baseline | PASSED (0.5707) | PASSED (0.5530) |
Serving benchmark (amd/gpt-oss-20b-w-mxfp4-a-bf16, TP=2, gfx950, 512 prompts, rate=10)
1024 in / 1024 out
| Metric | main triton | main aiter | PR triton | PR aiter |
|---|---|---|---|---|
| Successful / Failed requests | 512 / 0 | 512 / 0 | 512 / 0 | 512 / 0 |
| Output token throughput (tok/s) | 392.62 | 540.17 | 467.22 | 527.19 |
| Mean TTFT (ms) | 69.90 | 38.70 | 80.01 | 37.85 |
| Mean TPOT (ms) | 109.03 | 54.68 | 123.46 | 54.23 |
| Request throughput (req/s) | 6.86 | 9.46 | 8.15 | 9.53 |
8192 in / 8192 out
| Metric | main triton | main aiter | PR triton | PR aiter |
|---|---|---|---|---|
| Successful / Failed requests | 512 / 0 | 512 / 0 | 512 / 0 | 512 / 0 |
| Output token throughput (tok/s) | 309.76 | 337.94 | 252.00 | 472.75 |
| Mean TTFT (ms) | 628.07 | 347.65 | 699.16 | 343.38 |
| Mean TPOT (ms) | 329.76 | 162.47 | 389.91 | 164.78 |
| Request throughput (req/s) | 4.05 | 4.63 | 4.04 | 7.26 |
…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Jacob Lou <jacoblou0924@gmail.com>
…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…vllm-project#38504) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Three issues prevented
test_gpt_oss_triton_kernelsand GPQA serving from working on gfx950 (CDNA4) after the legacy routing deprecation:pack_bitmatrixsets bit 31 spuriously on HIP: Triton on HIP uses C-style integer division:-1 // 32 == 0(not-1as in Python). The masked-out lanes loaded withother=-1producediv=0, and1 << (-1 % 32)is undefined behavior that sets bit 31 on AMD. This causes expert 31 to appear in every bitmatrix row, corrupting theSparseMatrixrouting metadata and dropping valid token-expert pairs. On the serving path this results in GPU memory access faults that crash the server. Fixed by adding avalid = indices >= 0guard. On CUDA this is a no-op since Python-style floor division givesdiv=-1which never matches any column offset.Test padding alignment too small for CDNA4:
CDNA4MXScaleLayout.swizzle_datareshapes withSCALE_K // 8, requiringK % 256 == 0. The test usedround_up(K, 64)which is only sufficient on Hopper (whereHopperMXScaleLayoutpads internally). Made alignment platform-conditional: ROCm uses 256/512 matching production values inmxfp4_round_up_hidden_size_and_intermediate_size; CUDA keeps the original 64/128.gfx950 GPQA eval configs missing tokenizer and TP: The
amd/gpt-oss-20b-w-mxfp4-a-bf16model repo declarestokenizer_class=TokenizersBackendwhich is not a valid HuggingFace tokenizer, causing server launch failure. Added--tokenizer openai/gpt-oss-20b. Also added--tensor-parallel-size 2to all ROCm configs to prevent GPU memory faults withenforce-eager.Root cause detail
pack_bitmatrixloadstopk_idswithother=-1for out-of-bounds lanes. On HIP/ROCm:This silently inserts expert 31 into the bitmatrix for every token, which corrupts
col_sum(expert 31 gets countn_rowsinstead of 0), causingcol_sorted_indxto have-1entries that drop valid token-expert pairs from the computation. During serving, these corrupted indices lead to out-of-bounds memory accesses that crash the GPU.Test plan
test_gpt_oss_triton_kernels.py: 5/5 passed (was 1 passed, 4 failed)[LEGACY] Initial Test plan and Motivation
legacy_routing used legacy_routing_from_sparsematrix which extracted col_sorted_indx directly from the topk() SparseMatrix, producing indices that differ from the bitmatrix-reconstructed path (missing -1 padding for empty expert slots). In this PR we align legacy_routing with the same code path used by triton_kernel_moe_forward and make_routing_data: unpack topk() results and route through legacy_routing_from_bitmatrix (tuple/ROCm) or make_routing_data (SparseMatrix/NVIDIA).
Test plan
pytest tests/kernels/moe/test_gpt_oss_triton_kernels.py::test_legacy_routing -vMotivation: https://buildkite.com/vllm/amd-ci/builds/7062/steps/canvas?jid=019d382d-cbac-41c4-a53f-f4f56244488e&tab=output
cc @kenroche